{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 项目人：胡兆杰    \n",
    "## 时间：2020/07\n",
    "## 数据源：https://www.liepin.com/zhaopin/\n",
    "# 目标：\n",
    "#### 抓取大学本科毕业生在广州的产品经理职位的就业机会情况，为本科大学生提供关于产品经理的更多就业信息，为毕业生就业提供参考。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 一、抓取本科毕业生在广州产品经理职位的就业情况\n",
    "### 数据来源于猎聘网\n",
    "#### 抓取大学本科毕业生在广州的产品经理职位的就业机会情况"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "参考xpath语法:https://www.w3cschool.cn/xpath/xpath-syntax.html\n",
    "\n",
    "pandas基本模块：https://www.cnblogs.com/pfeiliu/p/12903211.html\n",
    "\n",
    "爬虫工具requests-html：https://www.cnblogs.com/fnng/p/8948015.html"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>edu</th>\n",
       "      <th>经验</th>\n",
       "      <th>薪水</th>\n",
       "      <th>时间</th>\n",
       "      <th>职称</th>\n",
       "      <th>公司地点</th>\n",
       "      <th>公司名称</th>\n",
       "      <th>链结</th>\n",
       "      <th>公司URL</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>本科及以上</td>\n",
       "      <td>3-5年</td>\n",
       "      <td>10-20k·12薪</td>\n",
       "      <td>2020年07月16日</td>\n",
       "      <td>小家电产品经理</td>\n",
       "      <td>广州-海珠区</td>\n",
       "      <td>广州海葳特科技有限公司</td>\n",
       "      <td>https://www.liepin.com/job/1927108427.shtml</td>\n",
       "      <td>https://www.liepin.com/company/9189185/</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>本科及以上</td>\n",
       "      <td>3-5年</td>\n",
       "      <td>12-20k·12薪</td>\n",
       "      <td>2020年07月19日</td>\n",
       "      <td>产品经理（直播/社交方向）</td>\n",
       "      <td>广州</td>\n",
       "      <td>广东映客互娱网络信息有限公司</td>\n",
       "      <td>https://www.liepin.com/job/1930041349.shtml</td>\n",
       "      <td>https://www.liepin.com/company/9906371/</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>硕士及以上</td>\n",
       "      <td>经验不限</td>\n",
       "      <td>7-8k·12薪</td>\n",
       "      <td>2020年07月19日</td>\n",
       "      <td>产品经理</td>\n",
       "      <td>广州</td>\n",
       "      <td>广州龙之杰科技有限公司</td>\n",
       "      <td>https://www.liepin.com/job/1929557555.shtml</td>\n",
       "      <td>https://www.liepin.com/company/5279579/</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>统招本科</td>\n",
       "      <td>3-5年</td>\n",
       "      <td>8-23k·12薪</td>\n",
       "      <td>2020年07月19日</td>\n",
       "      <td>产品总监/经理(吸尘器及清洁设备)</td>\n",
       "      <td>广州</td>\n",
       "      <td>杰诺智能科技有限公司</td>\n",
       "      <td>https://www.liepin.com/job/1929443381.shtml</td>\n",
       "      <td>https://www.liepin.com/company/12288013/</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>本科及以上</td>\n",
       "      <td>3-5年</td>\n",
       "      <td>16-25k·14薪</td>\n",
       "      <td>2020年07月19日</td>\n",
       "      <td>高级产品经理</td>\n",
       "      <td>广州-海珠区</td>\n",
       "      <td>信用生活(广州)智能科技有限公司</td>\n",
       "      <td>https://www.liepin.com/job/1929278189.shtml</td>\n",
       "      <td>https://www.liepin.com/company/9512616/</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>本科及以上</td>\n",
       "      <td>1-3年</td>\n",
       "      <td>13-18k·14薪</td>\n",
       "      <td>2020年07月19日</td>\n",
       "      <td>中级产品经理</td>\n",
       "      <td>广州-海珠区</td>\n",
       "      <td>信用生活(广州)智能科技有限公司</td>\n",
       "      <td>https://www.liepin.com/job/1929278163.shtml</td>\n",
       "      <td>https://www.liepin.com/company/9512616/</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>本科及以上</td>\n",
       "      <td>经验不限</td>\n",
       "      <td>10-15k·12薪</td>\n",
       "      <td>2020年07月19日</td>\n",
       "      <td>产品经理</td>\n",
       "      <td>广州-黄埔区</td>\n",
       "      <td>广州龙之杰科技有限公司</td>\n",
       "      <td>https://www.liepin.com/job/1928833897.shtml</td>\n",
       "      <td>https://www.liepin.com/company/5279579/</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>学历不限</td>\n",
       "      <td>经验不限</td>\n",
       "      <td>5-6k·12薪</td>\n",
       "      <td>2020年07月19日</td>\n",
       "      <td>产品经理（养老康复）</td>\n",
       "      <td>广州</td>\n",
       "      <td>广州龙之杰科技有限公司</td>\n",
       "      <td>https://www.liepin.com/job/1928833893.shtml</td>\n",
       "      <td>https://www.liepin.com/company/5279579/</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>本科及以上</td>\n",
       "      <td>1-3年</td>\n",
       "      <td>10-18k·12薪</td>\n",
       "      <td>2020年07月19日</td>\n",
       "      <td>产品经理（无人机）</td>\n",
       "      <td>广州-五山</td>\n",
       "      <td>广东国地规划科技股份有限公司</td>\n",
       "      <td>https://www.liepin.com/job/1928494651.shtml</td>\n",
       "      <td>https://www.liepin.com/company/9424014/</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>统招本科</td>\n",
       "      <td>3-5年</td>\n",
       "      <td>10-15k·12薪</td>\n",
       "      <td>2020年07月19日</td>\n",
       "      <td>BIM产品经理</td>\n",
       "      <td>广州-五山</td>\n",
       "      <td>广东国地规划科技股份有限公司</td>\n",
       "      <td>https://www.liepin.com/job/1928088905.shtml</td>\n",
       "      <td>https://www.liepin.com/company/9424014/</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>统招本科</td>\n",
       "      <td>5-10年</td>\n",
       "      <td>面议</td>\n",
       "      <td>2020年07月19日</td>\n",
       "      <td>儿保健产品总监/经理(J10118)</td>\n",
       "      <td>广州</td>\n",
       "      <td>榄菊集团</td>\n",
       "      <td>https://www.liepin.com/job/1927625117.shtml</td>\n",
       "      <td>https://www.liepin.com/company/2744156/</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>大专及以上</td>\n",
       "      <td>3-5年</td>\n",
       "      <td>7-14k·15薪</td>\n",
       "      <td>2020年07月19日</td>\n",
       "      <td>高级地区经理（诊断产品销售）</td>\n",
       "      <td>广州</td>\n",
       "      <td>科华生物工程</td>\n",
       "      <td>https://www.liepin.com/job/1927583025.shtml</td>\n",
       "      <td>https://www.liepin.com/company/3196133/</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>本科及以上</td>\n",
       "      <td>经验不限</td>\n",
       "      <td>20-30k·12薪</td>\n",
       "      <td>2020年07月19日</td>\n",
       "      <td>直播产品运营经理</td>\n",
       "      <td>广州</td>\n",
       "      <td>上海翡翠东方网络信息技术有限公司</td>\n",
       "      <td>https://www.liepin.com/job/1927571033.shtml</td>\n",
       "      <td>https://www.liepin.com/company/9947855/</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>本科及以上</td>\n",
       "      <td>3-5年</td>\n",
       "      <td>10-16k·12薪</td>\n",
       "      <td>2020年07月19日</td>\n",
       "      <td>产品经理/项目经理（无人机）</td>\n",
       "      <td>广州</td>\n",
       "      <td>广东国地规划科技股份有限公司</td>\n",
       "      <td>https://www.liepin.com/job/1927398855.shtml</td>\n",
       "      <td>https://www.liepin.com/company/9424014/</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>本科及以上</td>\n",
       "      <td>3-5年</td>\n",
       "      <td>15-20k·12薪</td>\n",
       "      <td>2020年07月19日</td>\n",
       "      <td>GIS产品经理</td>\n",
       "      <td>广州-天河区</td>\n",
       "      <td>广东国地规划科技股份有限公司</td>\n",
       "      <td>https://www.liepin.com/job/1926820317.shtml</td>\n",
       "      <td>https://www.liepin.com/company/9424014/</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>本科及以上</td>\n",
       "      <td>3-5年</td>\n",
       "      <td>20-30k·15薪</td>\n",
       "      <td>2020年07月19日</td>\n",
       "      <td>产品经理（营收）</td>\n",
       "      <td>广州</td>\n",
       "      <td>上海翡翠东方网络信息技术有限公司</td>\n",
       "      <td>https://www.liepin.com/job/1926712533.shtml</td>\n",
       "      <td>https://www.liepin.com/company/9947855/</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>本科及以上</td>\n",
       "      <td>10年以上</td>\n",
       "      <td>30-50k·14薪</td>\n",
       "      <td>2020年07月19日</td>\n",
       "      <td>智能财务产品经理</td>\n",
       "      <td>广州</td>\n",
       "      <td>金蝶</td>\n",
       "      <td>https://www.liepin.com/job/1924573173.shtml</td>\n",
       "      <td>https://www.liepin.com/company/1634243/</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>大专及以上</td>\n",
       "      <td>5-10年</td>\n",
       "      <td>16-33k·12薪</td>\n",
       "      <td>2020年07月19日</td>\n",
       "      <td>产品营销策划经理</td>\n",
       "      <td>广州</td>\n",
       "      <td>广州美粤文化传播有限公司</td>\n",
       "      <td>https://www.liepin.com/job/1924217811.shtml</td>\n",
       "      <td>https://www.liepin.com/company/10179103/</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>本科及以上</td>\n",
       "      <td>3-5年</td>\n",
       "      <td>20-30k·15薪</td>\n",
       "      <td>2020年07月19日</td>\n",
       "      <td>直播营收产品运营经理</td>\n",
       "      <td>广州-天河区</td>\n",
       "      <td>上海翡翠东方网络信息技术有限公司</td>\n",
       "      <td>https://www.liepin.com/job/1924139323.shtml</td>\n",
       "      <td>https://www.liepin.com/company/9947855/</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>统招本科</td>\n",
       "      <td>1-3年</td>\n",
       "      <td>18-30k·13薪</td>\n",
       "      <td>2020年07月19日</td>\n",
       "      <td>TMS产品经理 (MJ001313)</td>\n",
       "      <td>广州-番禺区</td>\n",
       "      <td>SHEIN</td>\n",
       "      <td>https://www.liepin.com/job/1922790939.shtml</td>\n",
       "      <td>https://www.liepin.com/company/9857585/</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>统招本科</td>\n",
       "      <td>3-5年</td>\n",
       "      <td>18-30k·13薪</td>\n",
       "      <td>2020年07月19日</td>\n",
       "      <td>WMS产品经理 (MJ001309)</td>\n",
       "      <td>广州-番禺区</td>\n",
       "      <td>SHEIN</td>\n",
       "      <td>https://www.liepin.com/job/1922751905.shtml</td>\n",
       "      <td>https://www.liepin.com/company/9857585/</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21</th>\n",
       "      <td>大专及以上</td>\n",
       "      <td>5-10年</td>\n",
       "      <td>8-12k·12薪</td>\n",
       "      <td>2020年07月19日</td>\n",
       "      <td>大气环境监测产品大项目销售经理（广东）</td>\n",
       "      <td>广州-东圃</td>\n",
       "      <td>无锡中科光电技术有限公司</td>\n",
       "      <td>https://www.liepin.com/job/1922585321.shtml</td>\n",
       "      <td>https://www.liepin.com/company/3944024/</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22</th>\n",
       "      <td>学历不限</td>\n",
       "      <td>5-10年</td>\n",
       "      <td>8-12k·13薪</td>\n",
       "      <td>2020年07月19日</td>\n",
       "      <td>产品开发经理</td>\n",
       "      <td>广州</td>\n",
       "      <td>知名化妆品公司</td>\n",
       "      <td>https://www.liepin.com/a/21327999.shtml</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23</th>\n",
       "      <td>本科及以上</td>\n",
       "      <td>经验不限</td>\n",
       "      <td>28-35k·16薪</td>\n",
       "      <td>2020年07月19日</td>\n",
       "      <td>产品寻源引进经理</td>\n",
       "      <td>广州</td>\n",
       "      <td>知名国内快消品牌</td>\n",
       "      <td>https://www.liepin.com/a/21294245.shtml</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24</th>\n",
       "      <td>学历不限</td>\n",
       "      <td>1-3年</td>\n",
       "      <td>25-35k·12薪</td>\n",
       "      <td>2020年07月19日</td>\n",
       "      <td>产品经理/高级产品经理</td>\n",
       "      <td>广州-番禺区</td>\n",
       "      <td>互联网</td>\n",
       "      <td>https://www.liepin.com/a/21268199.shtml</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25</th>\n",
       "      <td>统招本科</td>\n",
       "      <td>5-10年</td>\n",
       "      <td>35-65k·16薪</td>\n",
       "      <td>2020年07月19日</td>\n",
       "      <td>高级产品经理</td>\n",
       "      <td></td>\n",
       "      <td>国内500强企业</td>\n",
       "      <td>https://www.liepin.com/a/21245609.shtml</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>26</th>\n",
       "      <td>学历不限</td>\n",
       "      <td>5-10年</td>\n",
       "      <td>10-15k·12薪</td>\n",
       "      <td>2020年07月19日</td>\n",
       "      <td>产品开发经理</td>\n",
       "      <td>广州-越秀区</td>\n",
       "      <td>知名化妆品公司</td>\n",
       "      <td>https://www.liepin.com/a/21241065.shtml</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27</th>\n",
       "      <td>统招本科</td>\n",
       "      <td>3-5年</td>\n",
       "      <td>30-50k·14薪</td>\n",
       "      <td>2020年07月19日</td>\n",
       "      <td>高级产品经理</td>\n",
       "      <td>广州,北京,上海</td>\n",
       "      <td>深圳某知名互联网公司</td>\n",
       "      <td>https://www.liepin.com/a/21236935.shtml</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>28</th>\n",
       "      <td>统招本科</td>\n",
       "      <td>5-10年</td>\n",
       "      <td>20-40k·14薪</td>\n",
       "      <td>2020年07月19日</td>\n",
       "      <td>产品经理</td>\n",
       "      <td>广州</td>\n",
       "      <td>某车联网公司</td>\n",
       "      <td>https://www.liepin.com/a/21219993.shtml</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29</th>\n",
       "      <td>本科及以上</td>\n",
       "      <td>5-10年</td>\n",
       "      <td>20-40k·14薪</td>\n",
       "      <td>2020年07月19日</td>\n",
       "      <td>硬件产品经理</td>\n",
       "      <td>广州</td>\n",
       "      <td>某车联网公司</td>\n",
       "      <td>https://www.liepin.com/a/21219973.shtml</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30</th>\n",
       "      <td>统招本科</td>\n",
       "      <td>5-10年</td>\n",
       "      <td>45-60k·12薪</td>\n",
       "      <td>2020年07月19日</td>\n",
       "      <td>高级策略产品经理</td>\n",
       "      <td>广州</td>\n",
       "      <td>互联网媒体公司</td>\n",
       "      <td>https://www.liepin.com/a/21161513.shtml</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>31</th>\n",
       "      <td>统招本科</td>\n",
       "      <td>5-10年</td>\n",
       "      <td>20-30k·20薪</td>\n",
       "      <td>2020年07月19日</td>\n",
       "      <td>金融产品经理（B端）</td>\n",
       "      <td>广州</td>\n",
       "      <td>某股份制银行</td>\n",
       "      <td>https://www.liepin.com/a/21150207.shtml</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>32</th>\n",
       "      <td>统招本科</td>\n",
       "      <td>5-10年</td>\n",
       "      <td>15-20k·12薪</td>\n",
       "      <td>2020年07月19日</td>\n",
       "      <td>产品市场经理</td>\n",
       "      <td>广州</td>\n",
       "      <td>某软件公司</td>\n",
       "      <td>https://www.liepin.com/a/21147143.shtml</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33</th>\n",
       "      <td>统招本科</td>\n",
       "      <td>3-5年</td>\n",
       "      <td>15-20k·12薪</td>\n",
       "      <td>2020年07月19日</td>\n",
       "      <td>金融产品经理</td>\n",
       "      <td>广州-海珠区,广州-天河区</td>\n",
       "      <td>广州某金融公司</td>\n",
       "      <td>https://www.liepin.com/a/21146005.shtml</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>34</th>\n",
       "      <td>统招本科</td>\n",
       "      <td>经验不限</td>\n",
       "      <td>35-50k·16薪</td>\n",
       "      <td>2020年07月19日</td>\n",
       "      <td>产品策划经理-互通</td>\n",
       "      <td>深圳,广州,成都</td>\n",
       "      <td>国内知名互联网企业</td>\n",
       "      <td>https://www.liepin.com/a/21079175.shtml</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>35</th>\n",
       "      <td>本科及以上</td>\n",
       "      <td>5-10年</td>\n",
       "      <td>20-30k·14薪</td>\n",
       "      <td>2020年07月19日</td>\n",
       "      <td>产品（灯饰照明）经理/高级经理</td>\n",
       "      <td>中山,深圳,广州</td>\n",
       "      <td>国内照明行业领军品牌</td>\n",
       "      <td>https://www.liepin.com/a/21044941.shtml</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>36</th>\n",
       "      <td>大专及以上</td>\n",
       "      <td>5-10年</td>\n",
       "      <td>30-50k·15薪</td>\n",
       "      <td>2020年07月19日</td>\n",
       "      <td>运营商BSS、OSS产品规划经理产品规划总监（部门负责人）</td>\n",
       "      <td>广州</td>\n",
       "      <td>广州某物联网，大数据，企业信息化集成商</td>\n",
       "      <td>https://www.liepin.com/a/21037937.shtml</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>37</th>\n",
       "      <td>大专及以上</td>\n",
       "      <td>5-10年</td>\n",
       "      <td>18-30k·13薪</td>\n",
       "      <td>2020年07月19日</td>\n",
       "      <td>产品经理</td>\n",
       "      <td>佛山,广州,中山</td>\n",
       "      <td>上市电器公司</td>\n",
       "      <td>https://www.liepin.com/a/21017491.shtml</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>38</th>\n",
       "      <td>本科及以上</td>\n",
       "      <td>经验不限</td>\n",
       "      <td>15-25k·15薪</td>\n",
       "      <td>2020年07月19日</td>\n",
       "      <td>产品营销经理</td>\n",
       "      <td>宁波,广州,上海</td>\n",
       "      <td>某知名家电企业</td>\n",
       "      <td>https://www.liepin.com/a/21014979.shtml</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>39</th>\n",
       "      <td>统招本科</td>\n",
       "      <td>5-10年</td>\n",
       "      <td>20-35k·16薪</td>\n",
       "      <td>2020年07月19日</td>\n",
       "      <td>产品寻源引进经理</td>\n",
       "      <td>广州-天河区</td>\n",
       "      <td>国内知名快消企业</td>\n",
       "      <td>https://www.liepin.com/a/21006503.shtml</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      edu     经验          薪水           时间                             职称  \\\n",
       "0   本科及以上   3-5年  10-20k·12薪  2020年07月16日                        小家电产品经理   \n",
       "1   本科及以上   3-5年  12-20k·12薪  2020年07月19日                  产品经理（直播/社交方向）   \n",
       "2   硕士及以上   经验不限    7-8k·12薪  2020年07月19日                           产品经理   \n",
       "3    统招本科   3-5年   8-23k·12薪  2020年07月19日              产品总监/经理(吸尘器及清洁设备)   \n",
       "4   本科及以上   3-5年  16-25k·14薪  2020年07月19日                         高级产品经理   \n",
       "5   本科及以上   1-3年  13-18k·14薪  2020年07月19日                         中级产品经理   \n",
       "6   本科及以上   经验不限  10-15k·12薪  2020年07月19日                           产品经理   \n",
       "7    学历不限   经验不限    5-6k·12薪  2020年07月19日                     产品经理（养老康复）   \n",
       "8   本科及以上   1-3年  10-18k·12薪  2020年07月19日                      产品经理（无人机）   \n",
       "9    统招本科   3-5年  10-15k·12薪  2020年07月19日                        BIM产品经理   \n",
       "10   统招本科  5-10年          面议  2020年07月19日             儿保健产品总监/经理(J10118)   \n",
       "11  大专及以上   3-5年   7-14k·15薪  2020年07月19日                 高级地区经理（诊断产品销售）   \n",
       "12  本科及以上   经验不限  20-30k·12薪  2020年07月19日                       直播产品运营经理   \n",
       "13  本科及以上   3-5年  10-16k·12薪  2020年07月19日                 产品经理/项目经理（无人机）   \n",
       "14  本科及以上   3-5年  15-20k·12薪  2020年07月19日                        GIS产品经理   \n",
       "15  本科及以上   3-5年  20-30k·15薪  2020年07月19日                       产品经理（营收）   \n",
       "16  本科及以上  10年以上  30-50k·14薪  2020年07月19日                       智能财务产品经理   \n",
       "17  大专及以上  5-10年  16-33k·12薪  2020年07月19日                       产品营销策划经理   \n",
       "18  本科及以上   3-5年  20-30k·15薪  2020年07月19日                     直播营收产品运营经理   \n",
       "19   统招本科   1-3年  18-30k·13薪  2020年07月19日             TMS产品经理 (MJ001313)   \n",
       "20   统招本科   3-5年  18-30k·13薪  2020年07月19日             WMS产品经理 (MJ001309)   \n",
       "21  大专及以上  5-10年   8-12k·12薪  2020年07月19日            大气环境监测产品大项目销售经理（广东）   \n",
       "22   学历不限  5-10年   8-12k·13薪  2020年07月19日                         产品开发经理   \n",
       "23  本科及以上   经验不限  28-35k·16薪  2020年07月19日                       产品寻源引进经理   \n",
       "24   学历不限   1-3年  25-35k·12薪  2020年07月19日                    产品经理/高级产品经理   \n",
       "25   统招本科  5-10年  35-65k·16薪  2020年07月19日                         高级产品经理   \n",
       "26   学历不限  5-10年  10-15k·12薪  2020年07月19日                         产品开发经理   \n",
       "27   统招本科   3-5年  30-50k·14薪  2020年07月19日                         高级产品经理   \n",
       "28   统招本科  5-10年  20-40k·14薪  2020年07月19日                           产品经理   \n",
       "29  本科及以上  5-10年  20-40k·14薪  2020年07月19日                         硬件产品经理   \n",
       "30   统招本科  5-10年  45-60k·12薪  2020年07月19日                       高级策略产品经理   \n",
       "31   统招本科  5-10年  20-30k·20薪  2020年07月19日                     金融产品经理（B端）   \n",
       "32   统招本科  5-10年  15-20k·12薪  2020年07月19日                         产品市场经理   \n",
       "33   统招本科   3-5年  15-20k·12薪  2020年07月19日                         金融产品经理   \n",
       "34   统招本科   经验不限  35-50k·16薪  2020年07月19日                      产品策划经理-互通   \n",
       "35  本科及以上  5-10年  20-30k·14薪  2020年07月19日                产品（灯饰照明）经理/高级经理   \n",
       "36  大专及以上  5-10年  30-50k·15薪  2020年07月19日  运营商BSS、OSS产品规划经理产品规划总监（部门负责人）   \n",
       "37  大专及以上  5-10年  18-30k·13薪  2020年07月19日                           产品经理   \n",
       "38  本科及以上   经验不限  15-25k·15薪  2020年07月19日                         产品营销经理   \n",
       "39   统招本科  5-10年  20-35k·16薪  2020年07月19日                       产品寻源引进经理   \n",
       "\n",
       "             公司地点                 公司名称  \\\n",
       "0          广州-海珠区          广州海葳特科技有限公司   \n",
       "1              广州       广东映客互娱网络信息有限公司   \n",
       "2              广州          广州龙之杰科技有限公司   \n",
       "3              广州           杰诺智能科技有限公司   \n",
       "4          广州-海珠区     信用生活(广州)智能科技有限公司   \n",
       "5          广州-海珠区     信用生活(广州)智能科技有限公司   \n",
       "6          广州-黄埔区          广州龙之杰科技有限公司   \n",
       "7              广州          广州龙之杰科技有限公司   \n",
       "8           广州-五山       广东国地规划科技股份有限公司   \n",
       "9           广州-五山       广东国地规划科技股份有限公司   \n",
       "10             广州                 榄菊集团   \n",
       "11             广州               科华生物工程   \n",
       "12             广州     上海翡翠东方网络信息技术有限公司   \n",
       "13             广州       广东国地规划科技股份有限公司   \n",
       "14         广州-天河区       广东国地规划科技股份有限公司   \n",
       "15             广州     上海翡翠东方网络信息技术有限公司   \n",
       "16             广州                   金蝶   \n",
       "17             广州         广州美粤文化传播有限公司   \n",
       "18         广州-天河区     上海翡翠东方网络信息技术有限公司   \n",
       "19         广州-番禺区                SHEIN   \n",
       "20         广州-番禺区                SHEIN   \n",
       "21          广州-东圃         无锡中科光电技术有限公司   \n",
       "22             广州              知名化妆品公司   \n",
       "23             广州             知名国内快消品牌   \n",
       "24         广州-番禺区                  互联网   \n",
       "25                            国内500强企业   \n",
       "26         广州-越秀区              知名化妆品公司   \n",
       "27       广州,北京,上海           深圳某知名互联网公司   \n",
       "28             广州               某车联网公司   \n",
       "29             广州               某车联网公司   \n",
       "30             广州              互联网媒体公司   \n",
       "31             广州               某股份制银行   \n",
       "32             广州                某软件公司   \n",
       "33  广州-海珠区,广州-天河区              广州某金融公司   \n",
       "34       深圳,广州,成都            国内知名互联网企业   \n",
       "35       中山,深圳,广州           国内照明行业领军品牌   \n",
       "36             广州  广州某物联网，大数据，企业信息化集成商   \n",
       "37       佛山,广州,中山               上市电器公司   \n",
       "38       宁波,广州,上海              某知名家电企业   \n",
       "39         广州-天河区             国内知名快消企业   \n",
       "\n",
       "                                             链结  \\\n",
       "0   https://www.liepin.com/job/1927108427.shtml   \n",
       "1   https://www.liepin.com/job/1930041349.shtml   \n",
       "2   https://www.liepin.com/job/1929557555.shtml   \n",
       "3   https://www.liepin.com/job/1929443381.shtml   \n",
       "4   https://www.liepin.com/job/1929278189.shtml   \n",
       "5   https://www.liepin.com/job/1929278163.shtml   \n",
       "6   https://www.liepin.com/job/1928833897.shtml   \n",
       "7   https://www.liepin.com/job/1928833893.shtml   \n",
       "8   https://www.liepin.com/job/1928494651.shtml   \n",
       "9   https://www.liepin.com/job/1928088905.shtml   \n",
       "10  https://www.liepin.com/job/1927625117.shtml   \n",
       "11  https://www.liepin.com/job/1927583025.shtml   \n",
       "12  https://www.liepin.com/job/1927571033.shtml   \n",
       "13  https://www.liepin.com/job/1927398855.shtml   \n",
       "14  https://www.liepin.com/job/1926820317.shtml   \n",
       "15  https://www.liepin.com/job/1926712533.shtml   \n",
       "16  https://www.liepin.com/job/1924573173.shtml   \n",
       "17  https://www.liepin.com/job/1924217811.shtml   \n",
       "18  https://www.liepin.com/job/1924139323.shtml   \n",
       "19  https://www.liepin.com/job/1922790939.shtml   \n",
       "20  https://www.liepin.com/job/1922751905.shtml   \n",
       "21  https://www.liepin.com/job/1922585321.shtml   \n",
       "22      https://www.liepin.com/a/21327999.shtml   \n",
       "23      https://www.liepin.com/a/21294245.shtml   \n",
       "24      https://www.liepin.com/a/21268199.shtml   \n",
       "25      https://www.liepin.com/a/21245609.shtml   \n",
       "26      https://www.liepin.com/a/21241065.shtml   \n",
       "27      https://www.liepin.com/a/21236935.shtml   \n",
       "28      https://www.liepin.com/a/21219993.shtml   \n",
       "29      https://www.liepin.com/a/21219973.shtml   \n",
       "30      https://www.liepin.com/a/21161513.shtml   \n",
       "31      https://www.liepin.com/a/21150207.shtml   \n",
       "32      https://www.liepin.com/a/21147143.shtml   \n",
       "33      https://www.liepin.com/a/21146005.shtml   \n",
       "34      https://www.liepin.com/a/21079175.shtml   \n",
       "35      https://www.liepin.com/a/21044941.shtml   \n",
       "36      https://www.liepin.com/a/21037937.shtml   \n",
       "37      https://www.liepin.com/a/21017491.shtml   \n",
       "38      https://www.liepin.com/a/21014979.shtml   \n",
       "39      https://www.liepin.com/a/21006503.shtml   \n",
       "\n",
       "                                       公司URL  \n",
       "0    https://www.liepin.com/company/9189185/  \n",
       "1    https://www.liepin.com/company/9906371/  \n",
       "2    https://www.liepin.com/company/5279579/  \n",
       "3   https://www.liepin.com/company/12288013/  \n",
       "4    https://www.liepin.com/company/9512616/  \n",
       "5    https://www.liepin.com/company/9512616/  \n",
       "6    https://www.liepin.com/company/5279579/  \n",
       "7    https://www.liepin.com/company/5279579/  \n",
       "8    https://www.liepin.com/company/9424014/  \n",
       "9    https://www.liepin.com/company/9424014/  \n",
       "10   https://www.liepin.com/company/2744156/  \n",
       "11   https://www.liepin.com/company/3196133/  \n",
       "12   https://www.liepin.com/company/9947855/  \n",
       "13   https://www.liepin.com/company/9424014/  \n",
       "14   https://www.liepin.com/company/9424014/  \n",
       "15   https://www.liepin.com/company/9947855/  \n",
       "16   https://www.liepin.com/company/1634243/  \n",
       "17  https://www.liepin.com/company/10179103/  \n",
       "18   https://www.liepin.com/company/9947855/  \n",
       "19   https://www.liepin.com/company/9857585/  \n",
       "20   https://www.liepin.com/company/9857585/  \n",
       "21   https://www.liepin.com/company/3944024/  \n",
       "22                                            \n",
       "23                                            \n",
       "24                                            \n",
       "25                                            \n",
       "26                                            \n",
       "27                                            \n",
       "28                                            \n",
       "29                                            \n",
       "30                                            \n",
       "31                                            \n",
       "32                                            \n",
       "33                                            \n",
       "34                                            \n",
       "35                                            \n",
       "36                                            \n",
       "37                                            \n",
       "38                                            \n",
       "39                                            "
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 引用pandas基本模块\n",
    "import pandas as pd\n",
    "from requests_html import HTMLSession\n",
    "\n",
    "def requests_liepin( url, params):\n",
    "    r = session.get( url , params = payload)\n",
    "\n",
    "url = \"https://www.liepin.com/zhaopin/?compkind=&dqs=050020&pubTime=&pageSize=40&salary=&compTag=&sortFlag=15&compIds=&subIndustry=&jobKind=&industries=&compscale=&key=%E4%BA%A7%E5%93%81%E7%BB%8F%E7%90%86&siTag=i9Jq-FcUGTpC9QESjC5G3Q%7EfA9rXquZc5IkJpXC-Ycixw&d_sfrom=search_prime&d_ckId=c08334cb0fe32408884357450c5a9ead&d_curPage=0&d_pageSize=40&d_headId=c08334cb0fe32408884357450c5a9ead\"\n",
    "session = HTMLSession()\n",
    "r = session.get( url )\n",
    "\n",
    "# 先取特定元素，后精挑\n",
    "主要元素 = r.html.xpath( \\\n",
    "    '//ul[@class=\"sojob-list\"]/li')\n",
    "\n",
    "\n",
    "# xpath字典\n",
    "dict_xpaths={ \n",
    "    'text': {\n",
    "        'edu':      '//div[contains(@class,\"job-info\")]/p/span[@class=\"edu\"]',\n",
    "        '经验':      '//div[contains(@class,\"job-info\")]/p/span[@class=\"edu\"]/following-sibling::span',\n",
    "        '薪水':    '//div[contains(@class,\"job-info\")]/p/span[@class=\"text-warning\"]', \n",
    "        '时间':    '//div[contains(@class,\"job-info\")]/p/time/@title', \n",
    "        '职称':    '//div[contains(@class,\"job-info\")]/h3/a', \n",
    "        '公司地点': '//div[contains(@class,\"job-info\")]/p/a',\n",
    "        '公司名称': '//div[contains(@class,\"sojob-item-main\")]//p[@class=\"company-name\"]/a', \n",
    "    },\n",
    "    'text_content': {\n",
    "    },\n",
    "    'href': {\n",
    "        '链结':    '//div[contains(@class,\"job-info\")]/h3/a', \n",
    "        '公司URL': '//div[contains(@class,\"sojob-item-main\")]//p[@class=\"company-name\"]/a', \n",
    "    }\n",
    "}\n",
    "\n",
    "# 高级列表推导\n",
    "def get_e_text_content(_xpath_):\n",
    "    暂存结果 = [e.xpath(_xpath_)[0].lxml.text_content() for e in 主要元素]\n",
    "    return(暂存结果)\n",
    "\n",
    "def get_e_text(_xpath_):\n",
    "    暂存结果 = [\"\".join([x.strip() if type(x) is str else x.text.strip() for x in e.xpath(_xpath_)]) for e in 主要元素]\n",
    "    return(暂存结果)\n",
    "\n",
    "def get_e_href(_xpath_):\n",
    "    暂存结果 = [list(e.xpath(_xpath_, first=True).absolute_links)[0] \\\n",
    "               if len(e.xpath(_xpath_, first=True).absolute_links) >= 1  \\\n",
    "               else \"\" for e in 主要元素]\n",
    "    return(暂存结果)\n",
    "\n",
    "# 只对主要元素下进行xpath取值\n",
    "数据字典 = dict()\n",
    "\n",
    "数据字典 = {k:get_e_text_content(v) for k,v in dict_xpaths['text_content'].items()}\n",
    "数据字典.update({k:get_e_text(v) for k,v in dict_xpaths['text'].items()})\n",
    "数据字典.update({k:get_e_href(v) for k,v in dict_xpaths['href'].items()})\n",
    "\n",
    "\n",
    "数据 = pd.DataFrame(数据字典)\n",
    "数据.to_csv(\"大学本科毕业生在广州的产品经理就业情况.tsv\", index=False, sep='\\t', encoding='utf-8')\n",
    "数据.to_excel(\"大学本科毕业生在广州的产品经理就业情况.xlsx\", sheet_name=\"搜查结果\")\n",
    "数据 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "工作 产品经理-YCJ\n",
      "链接 {'https://www.liepin.com/job/1915251369.shtml'}\n",
      "工作 产品经理\n",
      "链接 {'https://www.liepin.com/job/1930013703.shtml'}\n",
      "工作 保健食品产品经理\n",
      "链接 {'https://www.liepin.com/job/1930011415.shtml'}\n",
      "工作 硬件产品经理（消费电子经验）\n",
      "链接 {'https://www.liepin.com/job/1930002519.shtml'}\n",
      "工作 车载智能调度产品经理\n",
      "链接 {'https://www.liepin.com/job/1930001481.shtml'}\n",
      "工作 产品精算岗/产品精算经理\n",
      "链接 {'https://www.liepin.com/job/1913213116.shtml'}\n",
      "工作 产品经理\n",
      "链接 {'https://www.liepin.com/job/1929967675.shtml'}\n",
      "工作 车联网产品经理\n",
      "链接 {'https://www.liepin.com/job/1929562157.shtml'}\n",
      "工作 英语产品经理（保险核心系统方向）\n",
      "链接 {'https://www.liepin.com/job/1929897199.shtml'}\n",
      "工作 快消产品经理\n",
      "链接 {'https://www.liepin.com/job/1929876643.shtml'}\n",
      "工作 产品经理\n",
      "链接 {'https://www.liepin.com/job/1929756983.shtml'}\n",
      "工作 产品经理（财务方向）\n",
      "链接 {'https://www.liepin.com/job/1929604271.shtml'}\n",
      "工作 内存产品经理（内存）\n",
      "链接 {'https://www.liepin.com/job/1929384471.shtml'}\n",
      "工作 餐饮产品经理人\n",
      "链接 {'https://www.liepin.com/job/1928842873.shtml'}\n",
      "工作 产品策略企划主管/经理（北京）\n",
      "链接 {'https://www.liepin.com/job/1928239299.shtml'}\n",
      "工作 医疗器械产品经理\n",
      "链接 {'https://www.liepin.com/job/1926265681.shtml'}\n",
      "工作 HR产品经理(J12892)\n",
      "链接 {'https://www.liepin.com/job/1929939743.shtml'}\n",
      "工作 产品经理(投融资产品方向)\n",
      "链接 {'https://www.liepin.com/job/1927547055.shtml'}\n",
      "工作 PM\n",
      "链接 {'https://www.liepin.com/job/1924152831.shtml'}\n",
      "工作 临床研究项目经理（PM）\n",
      "链接 {'https://www.liepin.com/job/1926218903.shtml'}\n",
      "工作 高级社交产品经理\n",
      "链接 {'https://www.liepin.com/job/1929970195.shtml'}\n",
      "工作 产品经理\n",
      "链接 {'https://www.liepin.com/job/1929964365.shtml'}\n",
      "工作 无线产品解决方案经理\n",
      "链接 {'https://www.liepin.com/job/1929524637.shtml'}\n",
      "工作 小家电产品经理\n",
      "链接 {'https://www.liepin.com/job/1927108427.shtml'}\n",
      "工作 软件产品经理-上海\n",
      "链接 {'https://www.liepin.com/job/1924393391.shtml'}\n",
      "工作 mes产品经理\n",
      "链接 {'https://www.liepin.com/job/1929949425.shtml'}\n",
      "工作 产品经理\n",
      "链接 {'https://www.liepin.com/job/1929936109.shtml'}\n",
      "工作 高级产品经理\n",
      "链接 {'https://www.liepin.com/job/1928542207.shtml'}\n",
      "工作 产品经理\n",
      "链接 {'https://www.liepin.com/job/1926802527.shtml'}\n",
      "工作 电商产品经理（产品开发）\n",
      "链接 {'https://www.liepin.com/job/1928587613.shtml'}\n",
      "工作 全国市场产品经理\n",
      "链接 {'https://www.liepin.com/job/1929861555.shtml'}\n",
      "工作 产品经理\n",
      "链接 {'https://www.liepin.com/job/1929882811.shtml'}\n",
      "工作 产品经理\n",
      "链接 {'https://www.liepin.com/job/1929199499.shtml'}\n",
      "工作 商业产品经理\n",
      "链接 {'https://www.liepin.com/a/20581509.shtml'}\n",
      "工作 产品经理（接受非互联网转型）\n",
      "链接 {'https://www.liepin.com/job/1929795793.shtml'}\n",
      "工作 资深产品经理（C端用户增长方向）\n",
      "链接 {'https://www.liepin.com/job/1929677019.shtml'}\n",
      "工作 产品经理\n",
      "链接 {'https://www.liepin.com/job/1927313225.shtml'}\n",
      "工作 电网销渠道产品&项目经理\n",
      "链接 {'https://www.liepin.com/job/1929544055.shtml'}\n",
      "工作 产品经理\n",
      "链接 {'https://www.liepin.com/job/1927290297.shtml'}\n",
      "工作 Product Manager/Senior Product Manager\n",
      "链接 {'https://www.liepin.com/job/1929542961.shtml'}\n"
     ]
    }
   ],
   "source": [
    "from requests_html import HTMLSession\n",
    "\n",
    "session = HTMLSession()\n",
    "\n",
    "r = session.get(\"https://www.liepin.com/zhaopin/?key=产品经理\")\n",
    "\n",
    "# 通过xpath选择器找到工作标签\n",
    "news = r.html.xpath('//div[@class=\"job-info\"]/h3/a')\n",
    "\n",
    "for new in news:\n",
    "    print(\"工作\",new.text)  # 获得工作标题\n",
    "    print(\"链接\",new.absolute_links)  # 获得工作链接"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 翻页：参数字典的拆解\n",
    "## xpath解析翻页a/@href"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[<Element 'a' class=('first', 'disabled') href='javascript:;' title='首页'>, <Element 'a' class=('disabled',) href='javascript:;'>, <Element 'a' class=('current',) href='javascript:;'>, <Element 'a' href='/zhaopin/?compkind=&dqs=050020&pubTime=&pageSize=40&salary=&compTag=&sortFlag=15°radeFlag=0&compIds=&subIndustry=&jobKind=&industries=&compscale=&key=%E4%BA%A7%E5%93%81%E7%BB%8F%E7%90%86&siTag=i9Jq-FcUGTpC9QESjC5G3Q%7E_FrslumzzaHrHe3aSW0VTQ&d_sfrom=search_prime&d_ckId=9505ccc598277c1e0f848712aa43bca9&d_curPage=0&d_pageSize=40&d_headId=c08334cb0fe32408884357450c5a9ead&curPage=1'>, <Element 'a' href='/zhaopin/?compkind=&dqs=050020&pubTime=&pageSize=40&salary=&compTag=&sortFlag=15°radeFlag=0&compIds=&subIndustry=&jobKind=&industries=&compscale=&key=%E4%BA%A7%E5%93%81%E7%BB%8F%E7%90%86&siTag=i9Jq-FcUGTpC9QESjC5G3Q%7E_FrslumzzaHrHe3aSW0VTQ&d_sfrom=search_prime&d_ckId=9505ccc598277c1e0f848712aa43bca9&d_curPage=0&d_pageSize=40&d_headId=c08334cb0fe32408884357450c5a9ead&curPage=2'>, <Element 'a' href='/zhaopin/?compkind=&dqs=050020&pubTime=&pageSize=40&salary=&compTag=&sortFlag=15°radeFlag=0&compIds=&subIndustry=&jobKind=&industries=&compscale=&key=%E4%BA%A7%E5%93%81%E7%BB%8F%E7%90%86&siTag=i9Jq-FcUGTpC9QESjC5G3Q%7E_FrslumzzaHrHe3aSW0VTQ&d_sfrom=search_prime&d_ckId=9505ccc598277c1e0f848712aa43bca9&d_curPage=0&d_pageSize=40&d_headId=c08334cb0fe32408884357450c5a9ead&curPage=3'>, <Element 'a' href='/zhaopin/?compkind=&dqs=050020&pubTime=&pageSize=40&salary=&compTag=&sortFlag=15°radeFlag=0&compIds=&subIndustry=&jobKind=&industries=&compscale=&key=%E4%BA%A7%E5%93%81%E7%BB%8F%E7%90%86&siTag=i9Jq-FcUGTpC9QESjC5G3Q%7E_FrslumzzaHrHe3aSW0VTQ&d_sfrom=search_prime&d_ckId=9505ccc598277c1e0f848712aa43bca9&d_curPage=0&d_pageSize=40&d_headId=c08334cb0fe32408884357450c5a9ead&curPage=4'>, <Element 'a' href='/zhaopin/?compkind=&dqs=050020&pubTime=&pageSize=40&salary=&compTag=&sortFlag=15°radeFlag=0&compIds=&subIndustry=&jobKind=&industries=&compscale=&key=%E4%BA%A7%E5%93%81%E7%BB%8F%E7%90%86&siTag=i9Jq-FcUGTpC9QESjC5G3Q%7E_FrslumzzaHrHe3aSW0VTQ&d_sfrom=search_prime&d_ckId=9505ccc598277c1e0f848712aa43bca9&d_curPage=0&d_pageSize=40&d_headId=c08334cb0fe32408884357450c5a9ead&curPage=1'>, <Element 'a' class=('last',) href='/zhaopin/?compkind=&dqs=050020&pubTime=&pageSize=40&salary=&compTag=&sortFlag=15°radeFlag=0&compIds=&subIndustry=&jobKind=&industries=&compscale=&key=%E4%BA%A7%E5%93%81%E7%BB%8F%E7%90%86&siTag=i9Jq-FcUGTpC9QESjC5G3Q%7E_FrslumzzaHrHe3aSW0VTQ&d_sfrom=search_prime&d_ckId=9505ccc598277c1e0f848712aa43bca9&d_curPage=0&d_pageSize=40&d_headId=c08334cb0fe32408884357450c5a9ead&curPage=9' title='末页'>]\n"
     ]
    }
   ],
   "source": [
    "# 单一页面\n",
    "url = \"https://www.liepin.com/zhaopin/?compkind=&dqs=050020&pubTime=&pageSize=40&salary=&compTag=&sortFlag=15&compIds=&subIndustry=&jobKind=&industries=&compscale=&key=%E4%BA%A7%E5%93%81%E7%BB%8F%E7%90%86&siTag=i9Jq-FcUGTpC9QESjC5G3Q%7EfA9rXquZc5IkJpXC-Ycixw&d_sfrom=search_prime&d_ckId=c08334cb0fe32408884357450c5a9ead&d_curPage=0&d_pageSize=40&d_headId=c08334cb0fe32408884357450c5a9ead\"\n",
    "session = HTMLSession()\n",
    "r = session.get( url )\n",
    "\n",
    "xpath_翻页a = '//div[@class=\"pagerbar\"]/a'\n",
    "print (r.html.xpath(xpath_翻页a)) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[<Element 'a' href='/zhaopin/?compkind=&dqs=050020&pubTime=&pageSize=40&salary=&compTag=&sortFlag=15°radeFlag=0&compIds=&subIndustry=&jobKind=&industries=&compscale=&key=%E4%BA%A7%E5%93%81%E7%BB%8F%E7%90%86&siTag=i9Jq-FcUGTpC9QESjC5G3Q%7E_FrslumzzaHrHe3aSW0VTQ&d_sfrom=search_prime&d_ckId=9505ccc598277c1e0f848712aa43bca9&d_curPage=0&d_pageSize=40&d_headId=c08334cb0fe32408884357450c5a9ead&curPage=1'>, <Element 'a' href='/zhaopin/?compkind=&dqs=050020&pubTime=&pageSize=40&salary=&compTag=&sortFlag=15°radeFlag=0&compIds=&subIndustry=&jobKind=&industries=&compscale=&key=%E4%BA%A7%E5%93%81%E7%BB%8F%E7%90%86&siTag=i9Jq-FcUGTpC9QESjC5G3Q%7E_FrslumzzaHrHe3aSW0VTQ&d_sfrom=search_prime&d_ckId=9505ccc598277c1e0f848712aa43bca9&d_curPage=0&d_pageSize=40&d_headId=c08334cb0fe32408884357450c5a9ead&curPage=2'>, <Element 'a' href='/zhaopin/?compkind=&dqs=050020&pubTime=&pageSize=40&salary=&compTag=&sortFlag=15°radeFlag=0&compIds=&subIndustry=&jobKind=&industries=&compscale=&key=%E4%BA%A7%E5%93%81%E7%BB%8F%E7%90%86&siTag=i9Jq-FcUGTpC9QESjC5G3Q%7E_FrslumzzaHrHe3aSW0VTQ&d_sfrom=search_prime&d_ckId=9505ccc598277c1e0f848712aa43bca9&d_curPage=0&d_pageSize=40&d_headId=c08334cb0fe32408884357450c5a9ead&curPage=3'>, <Element 'a' href='/zhaopin/?compkind=&dqs=050020&pubTime=&pageSize=40&salary=&compTag=&sortFlag=15°radeFlag=0&compIds=&subIndustry=&jobKind=&industries=&compscale=&key=%E4%BA%A7%E5%93%81%E7%BB%8F%E7%90%86&siTag=i9Jq-FcUGTpC9QESjC5G3Q%7E_FrslumzzaHrHe3aSW0VTQ&d_sfrom=search_prime&d_ckId=9505ccc598277c1e0f848712aa43bca9&d_curPage=0&d_pageSize=40&d_headId=c08334cb0fe32408884357450c5a9ead&curPage=4'>, <Element 'a' href='/zhaopin/?compkind=&dqs=050020&pubTime=&pageSize=40&salary=&compTag=&sortFlag=15°radeFlag=0&compIds=&subIndustry=&jobKind=&industries=&compscale=&key=%E4%BA%A7%E5%93%81%E7%BB%8F%E7%90%86&siTag=i9Jq-FcUGTpC9QESjC5G3Q%7E_FrslumzzaHrHe3aSW0VTQ&d_sfrom=search_prime&d_ckId=9505ccc598277c1e0f848712aa43bca9&d_curPage=0&d_pageSize=40&d_headId=c08334cb0fe32408884357450c5a9ead&curPage=1'>, <Element 'a' class=('last',) href='/zhaopin/?compkind=&dqs=050020&pubTime=&pageSize=40&salary=&compTag=&sortFlag=15°radeFlag=0&compIds=&subIndustry=&jobKind=&industries=&compscale=&key=%E4%BA%A7%E5%93%81%E7%BB%8F%E7%90%86&siTag=i9Jq-FcUGTpC9QESjC5G3Q%7E_FrslumzzaHrHe3aSW0VTQ&d_sfrom=search_prime&d_ckId=9505ccc598277c1e0f848712aa43bca9&d_curPage=0&d_pageSize=40&d_headId=c08334cb0fe32408884357450c5a9ead&curPage=9' title='末页'>]\n",
      "{'2': '/zhaopin/?compkind=&dqs=050020&pubTime=&pageSize=40&salary=&compTag=&sortFlag=15°radeFlag=0&compIds=&subIndustry=&jobKind=&industries=&compscale=&key=%E4%BA%A7%E5%93%81%E7%BB%8F%E7%90%86&siTag=i9Jq-FcUGTpC9QESjC5G3Q%7E_FrslumzzaHrHe3aSW0VTQ&d_sfrom=search_prime&d_ckId=9505ccc598277c1e0f848712aa43bca9&d_curPage=0&d_pageSize=40&d_headId=c08334cb0fe32408884357450c5a9ead&curPage=1', '3': '/zhaopin/?compkind=&dqs=050020&pubTime=&pageSize=40&salary=&compTag=&sortFlag=15°radeFlag=0&compIds=&subIndustry=&jobKind=&industries=&compscale=&key=%E4%BA%A7%E5%93%81%E7%BB%8F%E7%90%86&siTag=i9Jq-FcUGTpC9QESjC5G3Q%7E_FrslumzzaHrHe3aSW0VTQ&d_sfrom=search_prime&d_ckId=9505ccc598277c1e0f848712aa43bca9&d_curPage=0&d_pageSize=40&d_headId=c08334cb0fe32408884357450c5a9ead&curPage=2', '4': '/zhaopin/?compkind=&dqs=050020&pubTime=&pageSize=40&salary=&compTag=&sortFlag=15°radeFlag=0&compIds=&subIndustry=&jobKind=&industries=&compscale=&key=%E4%BA%A7%E5%93%81%E7%BB%8F%E7%90%86&siTag=i9Jq-FcUGTpC9QESjC5G3Q%7E_FrslumzzaHrHe3aSW0VTQ&d_sfrom=search_prime&d_ckId=9505ccc598277c1e0f848712aa43bca9&d_curPage=0&d_pageSize=40&d_headId=c08334cb0fe32408884357450c5a9ead&curPage=3', '5': '/zhaopin/?compkind=&dqs=050020&pubTime=&pageSize=40&salary=&compTag=&sortFlag=15°radeFlag=0&compIds=&subIndustry=&jobKind=&industries=&compscale=&key=%E4%BA%A7%E5%93%81%E7%BB%8F%E7%90%86&siTag=i9Jq-FcUGTpC9QESjC5G3Q%7E_FrslumzzaHrHe3aSW0VTQ&d_sfrom=search_prime&d_ckId=9505ccc598277c1e0f848712aa43bca9&d_curPage=0&d_pageSize=40&d_headId=c08334cb0fe32408884357450c5a9ead&curPage=4', '下一页': '/zhaopin/?compkind=&dqs=050020&pubTime=&pageSize=40&salary=&compTag=&sortFlag=15°radeFlag=0&compIds=&subIndustry=&jobKind=&industries=&compscale=&key=%E4%BA%A7%E5%93%81%E7%BB%8F%E7%90%86&siTag=i9Jq-FcUGTpC9QESjC5G3Q%7E_FrslumzzaHrHe3aSW0VTQ&d_sfrom=search_prime&d_ckId=9505ccc598277c1e0f848712aa43bca9&d_curPage=0&d_pageSize=40&d_headId=c08334cb0fe32408884357450c5a9ead&curPage=1', '': '/zhaopin/?compkind=&dqs=050020&pubTime=&pageSize=40&salary=&compTag=&sortFlag=15°radeFlag=0&compIds=&subIndustry=&jobKind=&industries=&compscale=&key=%E4%BA%A7%E5%93%81%E7%BB%8F%E7%90%86&siTag=i9Jq-FcUGTpC9QESjC5G3Q%7E_FrslumzzaHrHe3aSW0VTQ&d_sfrom=search_prime&d_ckId=9505ccc598277c1e0f848712aa43bca9&d_curPage=0&d_pageSize=40&d_headId=c08334cb0fe32408884357450c5a9ead&curPage=9'}\n"
     ]
    }
   ],
   "source": [
    "# A-1  xpath 解析翻页a/@href\n",
    "xpath_翻页a = '//div[@class=\"pagerbar\"]/a' # 有disabled, current等href是javascript\n",
    "xpath_翻页a = '//div[@class=\"pagerbar\"]/a[starts-with(@href,\"/zhaopin\")]'\n",
    "print (r.html.xpath(xpath_翻页a)) # 物件\n",
    "\n",
    "href_列表 = [x.xpath('//@href')[0] for x in r.html.xpath(xpath_翻页a)]\n",
    "# print (href_列表)\n",
    "\n",
    "文字_列表 = [x.text for x in r.html.xpath(xpath_翻页a)]\n",
    "# print (文字_列表)\n",
    "\n",
    "href_字典 = {x.text:x.xpath('//@href')[0]  for x in r.html.xpath(xpath_翻页a)}\n",
    "print (href_字典)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 建构参数模板：找到关键参数及参数结构"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>scheme</th>\n",
       "      <th>netloc</th>\n",
       "      <th>path</th>\n",
       "      <th>params</th>\n",
       "      <th>query</th>\n",
       "      <th>fragment</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>/zhaopin/</td>\n",
       "      <td></td>\n",
       "      <td>compkind=&amp;dqs=050020&amp;pubTime=&amp;pageSize=40&amp;sala...</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>/zhaopin/</td>\n",
       "      <td></td>\n",
       "      <td>compkind=&amp;dqs=050020&amp;pubTime=&amp;pageSize=40&amp;sala...</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>/zhaopin/</td>\n",
       "      <td></td>\n",
       "      <td>compkind=&amp;dqs=050020&amp;pubTime=&amp;pageSize=40&amp;sala...</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>/zhaopin/</td>\n",
       "      <td></td>\n",
       "      <td>compkind=&amp;dqs=050020&amp;pubTime=&amp;pageSize=40&amp;sala...</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>/zhaopin/</td>\n",
       "      <td></td>\n",
       "      <td>compkind=&amp;dqs=050020&amp;pubTime=&amp;pageSize=40&amp;sala...</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>/zhaopin/</td>\n",
       "      <td></td>\n",
       "      <td>compkind=&amp;dqs=050020&amp;pubTime=&amp;pageSize=40&amp;sala...</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  scheme netloc       path params  \\\n",
       "0                /zhaopin/          \n",
       "1                /zhaopin/          \n",
       "2                /zhaopin/          \n",
       "3                /zhaopin/          \n",
       "4                /zhaopin/          \n",
       "5                /zhaopin/          \n",
       "\n",
       "                                               query fragment  \n",
       "0  compkind=&dqs=050020&pubTime=&pageSize=40&sala...           \n",
       "1  compkind=&dqs=050020&pubTime=&pageSize=40&sala...           \n",
       "2  compkind=&dqs=050020&pubTime=&pageSize=40&sala...           \n",
       "3  compkind=&dqs=050020&pubTime=&pageSize=40&sala...           \n",
       "4  compkind=&dqs=050020&pubTime=&pageSize=40&sala...           \n",
       "5  compkind=&dqs=050020&pubTime=&pageSize=40&sala...           "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "scheme      1\n",
      "netloc      1\n",
      "path        1\n",
      "params      1\n",
      "query       5\n",
      "fragment    1\n",
      "dtype: int64\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>dqs</th>\n",
       "      <th>pageSize</th>\n",
       "      <th>sortFlag</th>\n",
       "      <th>key</th>\n",
       "      <th>siTag</th>\n",
       "      <th>d_sfrom</th>\n",
       "      <th>d_ckId</th>\n",
       "      <th>d_curPage</th>\n",
       "      <th>d_pageSize</th>\n",
       "      <th>d_headId</th>\n",
       "      <th>curPage</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>050020</td>\n",
       "      <td>40</td>\n",
       "      <td>15°radeFlag=0</td>\n",
       "      <td>产品经理</td>\n",
       "      <td>i9Jq-FcUGTpC9QESjC5G3Q~_FrslumzzaHrHe3aSW0VTQ</td>\n",
       "      <td>search_prime</td>\n",
       "      <td>9505ccc598277c1e0f848712aa43bca9</td>\n",
       "      <td>0</td>\n",
       "      <td>40</td>\n",
       "      <td>c08334cb0fe32408884357450c5a9ead</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>050020</td>\n",
       "      <td>40</td>\n",
       "      <td>15°radeFlag=0</td>\n",
       "      <td>产品经理</td>\n",
       "      <td>i9Jq-FcUGTpC9QESjC5G3Q~_FrslumzzaHrHe3aSW0VTQ</td>\n",
       "      <td>search_prime</td>\n",
       "      <td>9505ccc598277c1e0f848712aa43bca9</td>\n",
       "      <td>0</td>\n",
       "      <td>40</td>\n",
       "      <td>c08334cb0fe32408884357450c5a9ead</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>050020</td>\n",
       "      <td>40</td>\n",
       "      <td>15°radeFlag=0</td>\n",
       "      <td>产品经理</td>\n",
       "      <td>i9Jq-FcUGTpC9QESjC5G3Q~_FrslumzzaHrHe3aSW0VTQ</td>\n",
       "      <td>search_prime</td>\n",
       "      <td>9505ccc598277c1e0f848712aa43bca9</td>\n",
       "      <td>0</td>\n",
       "      <td>40</td>\n",
       "      <td>c08334cb0fe32408884357450c5a9ead</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>050020</td>\n",
       "      <td>40</td>\n",
       "      <td>15°radeFlag=0</td>\n",
       "      <td>产品经理</td>\n",
       "      <td>i9Jq-FcUGTpC9QESjC5G3Q~_FrslumzzaHrHe3aSW0VTQ</td>\n",
       "      <td>search_prime</td>\n",
       "      <td>9505ccc598277c1e0f848712aa43bca9</td>\n",
       "      <td>0</td>\n",
       "      <td>40</td>\n",
       "      <td>c08334cb0fe32408884357450c5a9ead</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>050020</td>\n",
       "      <td>40</td>\n",
       "      <td>15°radeFlag=0</td>\n",
       "      <td>产品经理</td>\n",
       "      <td>i9Jq-FcUGTpC9QESjC5G3Q~_FrslumzzaHrHe3aSW0VTQ</td>\n",
       "      <td>search_prime</td>\n",
       "      <td>9505ccc598277c1e0f848712aa43bca9</td>\n",
       "      <td>0</td>\n",
       "      <td>40</td>\n",
       "      <td>c08334cb0fe32408884357450c5a9ead</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>050020</td>\n",
       "      <td>40</td>\n",
       "      <td>15°radeFlag=0</td>\n",
       "      <td>产品经理</td>\n",
       "      <td>i9Jq-FcUGTpC9QESjC5G3Q~_FrslumzzaHrHe3aSW0VTQ</td>\n",
       "      <td>search_prime</td>\n",
       "      <td>9505ccc598277c1e0f848712aa43bca9</td>\n",
       "      <td>0</td>\n",
       "      <td>40</td>\n",
       "      <td>c08334cb0fe32408884357450c5a9ead</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      dqs pageSize       sortFlag   key  \\\n",
       "0  050020       40  15°radeFlag=0  产品经理   \n",
       "1  050020       40  15°radeFlag=0  产品经理   \n",
       "2  050020       40  15°radeFlag=0  产品经理   \n",
       "3  050020       40  15°radeFlag=0  产品经理   \n",
       "4  050020       40  15°radeFlag=0  产品经理   \n",
       "5  050020       40  15°radeFlag=0  产品经理   \n",
       "\n",
       "                                           siTag       d_sfrom  \\\n",
       "0  i9Jq-FcUGTpC9QESjC5G3Q~_FrslumzzaHrHe3aSW0VTQ  search_prime   \n",
       "1  i9Jq-FcUGTpC9QESjC5G3Q~_FrslumzzaHrHe3aSW0VTQ  search_prime   \n",
       "2  i9Jq-FcUGTpC9QESjC5G3Q~_FrslumzzaHrHe3aSW0VTQ  search_prime   \n",
       "3  i9Jq-FcUGTpC9QESjC5G3Q~_FrslumzzaHrHe3aSW0VTQ  search_prime   \n",
       "4  i9Jq-FcUGTpC9QESjC5G3Q~_FrslumzzaHrHe3aSW0VTQ  search_prime   \n",
       "5  i9Jq-FcUGTpC9QESjC5G3Q~_FrslumzzaHrHe3aSW0VTQ  search_prime   \n",
       "\n",
       "                             d_ckId d_curPage d_pageSize  \\\n",
       "0  9505ccc598277c1e0f848712aa43bca9         0         40   \n",
       "1  9505ccc598277c1e0f848712aa43bca9         0         40   \n",
       "2  9505ccc598277c1e0f848712aa43bca9         0         40   \n",
       "3  9505ccc598277c1e0f848712aa43bca9         0         40   \n",
       "4  9505ccc598277c1e0f848712aa43bca9         0         40   \n",
       "5  9505ccc598277c1e0f848712aa43bca9         0         40   \n",
       "\n",
       "                           d_headId curPage  \n",
       "0  c08334cb0fe32408884357450c5a9ead       1  \n",
       "1  c08334cb0fe32408884357450c5a9ead       2  \n",
       "2  c08334cb0fe32408884357450c5a9ead       3  \n",
       "3  c08334cb0fe32408884357450c5a9ead       4  \n",
       "4  c08334cb0fe32408884357450c5a9ead       1  \n",
       "5  c08334cb0fe32408884357450c5a9ead       9  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "dqs           1\n",
      "pageSize      1\n",
      "sortFlag      1\n",
      "key           1\n",
      "siTag         1\n",
      "d_sfrom       1\n",
      "d_ckId        1\n",
      "d_curPage     1\n",
      "d_pageSize    1\n",
      "d_headId      1\n",
      "curPage       5\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "# 需要模组库\n",
    "from urllib.parse import urlparse, parse_qs\n",
    "import pandas as pd\n",
    "from IPython.display import display, HTML\n",
    "\n",
    "# 总体目标：输入 href_列表, 建构出参数字典\n",
    "\n",
    "# urlparse 解析后丢入数据框\n",
    "df = pd.DataFrame([ urlparse(x) for x in href_列表])\n",
    "df_qs = pd.DataFrame([{k:v[0] for k,v in parse_qs(x).items()} for x in df['query'] ])\n",
    "\n",
    "display(df)\n",
    "print(df.nunique())\n",
    "display(df_qs)\n",
    "print(df_qs.nunique())\n",
    "\n",
    "df_qs.curPage\n",
    "df_qs = df_qs.assign (curPage_int=df_qs.curPage.astype(int)) # 变成整数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'dqs': ['050020'], 'pageSize': ['40'], 'sortFlag': ['15°radeFlag=0'], 'key': ['产品经理'], 'siTag': ['i9Jq-FcUGTpC9QESjC5G3Q~_FrslumzzaHrHe3aSW0VTQ'], 'd_sfrom': ['search_prime'], 'd_ckId': ['9505ccc598277c1e0f848712aa43bca9'], 'd_curPage': ['0'], 'd_pageSize': ['40'], 'd_headId': ['c08334cb0fe32408884357450c5a9ead'], 'curPage': ['1']}\n",
      "{'2': '/zhaopin/?compkind=&dqs=050020&pubTime=&pageSize=40&salary=&compTag=&sortFlag=15°radeFlag=0&compIds=&subIndustry=&jobKind=&industries=&compscale=&key=%E4%BA%A7%E5%93%81%E7%BB%8F%E7%90%86&siTag=i9Jq-FcUGTpC9QESjC5G3Q%7E_FrslumzzaHrHe3aSW0VTQ&d_sfrom=search_prime&d_ckId=9505ccc598277c1e0f848712aa43bca9&d_curPage=0&d_pageSize=40&d_headId=c08334cb0fe32408884357450c5a9ead&curPage=1', '3': '/zhaopin/?compkind=&dqs=050020&pubTime=&pageSize=40&salary=&compTag=&sortFlag=15°radeFlag=0&compIds=&subIndustry=&jobKind=&industries=&compscale=&key=%E4%BA%A7%E5%93%81%E7%BB%8F%E7%90%86&siTag=i9Jq-FcUGTpC9QESjC5G3Q%7E_FrslumzzaHrHe3aSW0VTQ&d_sfrom=search_prime&d_ckId=9505ccc598277c1e0f848712aa43bca9&d_curPage=0&d_pageSize=40&d_headId=c08334cb0fe32408884357450c5a9ead&curPage=2', '4': '/zhaopin/?compkind=&dqs=050020&pubTime=&pageSize=40&salary=&compTag=&sortFlag=15°radeFlag=0&compIds=&subIndustry=&jobKind=&industries=&compscale=&key=%E4%BA%A7%E5%93%81%E7%BB%8F%E7%90%86&siTag=i9Jq-FcUGTpC9QESjC5G3Q%7E_FrslumzzaHrHe3aSW0VTQ&d_sfrom=search_prime&d_ckId=9505ccc598277c1e0f848712aa43bca9&d_curPage=0&d_pageSize=40&d_headId=c08334cb0fe32408884357450c5a9ead&curPage=3', '5': '/zhaopin/?compkind=&dqs=050020&pubTime=&pageSize=40&salary=&compTag=&sortFlag=15°radeFlag=0&compIds=&subIndustry=&jobKind=&industries=&compscale=&key=%E4%BA%A7%E5%93%81%E7%BB%8F%E7%90%86&siTag=i9Jq-FcUGTpC9QESjC5G3Q%7E_FrslumzzaHrHe3aSW0VTQ&d_sfrom=search_prime&d_ckId=9505ccc598277c1e0f848712aa43bca9&d_curPage=0&d_pageSize=40&d_headId=c08334cb0fe32408884357450c5a9ead&curPage=4', '下一页': '/zhaopin/?compkind=&dqs=050020&pubTime=&pageSize=40&salary=&compTag=&sortFlag=15°radeFlag=0&compIds=&subIndustry=&jobKind=&industries=&compscale=&key=%E4%BA%A7%E5%93%81%E7%BB%8F%E7%90%86&siTag=i9Jq-FcUGTpC9QESjC5G3Q%7E_FrslumzzaHrHe3aSW0VTQ&d_sfrom=search_prime&d_ckId=9505ccc598277c1e0f848712aa43bca9&d_curPage=0&d_pageSize=40&d_headId=c08334cb0fe32408884357450c5a9ead&curPage=1', '': '/zhaopin/?compkind=&dqs=050020&pubTime=&pageSize=40&salary=&compTag=&sortFlag=15°radeFlag=0&compIds=&subIndustry=&jobKind=&industries=&compscale=&key=%E4%BA%A7%E5%93%81%E7%BB%8F%E7%90%86&siTag=i9Jq-FcUGTpC9QESjC5G3Q%7E_FrslumzzaHrHe3aSW0VTQ&d_sfrom=search_prime&d_ckId=9505ccc598277c1e0f848712aa43bca9&d_curPage=0&d_pageSize=40&d_headId=c08334cb0fe32408884357450c5a9ead&curPage=9'}\n"
     ]
    }
   ],
   "source": [
    "def parse_url_qs_for_curPage (url):\n",
    "    six_parts = urlparse(url) \n",
    "    out = parse_qs(six_parts.query)\n",
    "    return (out)\n",
    "\n",
    "# 取一例做模板\n",
    "参数模板 = parse_url_qs_for_curPage(href_列表[0])\n",
    "print (参数模板)\n",
    "\n",
    "print (href_字典)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1\n",
      "9\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "{0: {'dqs': ['050020'],\n",
       "  'pageSize': ['40'],\n",
       "  'sortFlag': ['15°radeFlag=0'],\n",
       "  'key': ['产品经理'],\n",
       "  'siTag': ['i9Jq-FcUGTpC9QESjC5G3Q~_FrslumzzaHrHe3aSW0VTQ'],\n",
       "  'd_sfrom': ['search_prime'],\n",
       "  'd_ckId': ['9505ccc598277c1e0f848712aa43bca9'],\n",
       "  'd_curPage': ['0'],\n",
       "  'd_pageSize': ['40'],\n",
       "  'd_headId': ['c08334cb0fe32408884357450c5a9ead'],\n",
       "  'curPage': [0],\n",
       "  'keyword': ['薪水']},\n",
       " 1: {'dqs': ['050020'],\n",
       "  'pageSize': ['40'],\n",
       "  'sortFlag': ['15°radeFlag=0'],\n",
       "  'key': ['产品经理'],\n",
       "  'siTag': ['i9Jq-FcUGTpC9QESjC5G3Q~_FrslumzzaHrHe3aSW0VTQ'],\n",
       "  'd_sfrom': ['search_prime'],\n",
       "  'd_ckId': ['9505ccc598277c1e0f848712aa43bca9'],\n",
       "  'd_curPage': ['0'],\n",
       "  'd_pageSize': ['40'],\n",
       "  'd_headId': ['c08334cb0fe32408884357450c5a9ead'],\n",
       "  'curPage': [1],\n",
       "  'keyword': ['薪水']},\n",
       " 2: {'dqs': ['050020'],\n",
       "  'pageSize': ['40'],\n",
       "  'sortFlag': ['15°radeFlag=0'],\n",
       "  'key': ['产品经理'],\n",
       "  'siTag': ['i9Jq-FcUGTpC9QESjC5G3Q~_FrslumzzaHrHe3aSW0VTQ'],\n",
       "  'd_sfrom': ['search_prime'],\n",
       "  'd_ckId': ['9505ccc598277c1e0f848712aa43bca9'],\n",
       "  'd_curPage': ['0'],\n",
       "  'd_pageSize': ['40'],\n",
       "  'd_headId': ['c08334cb0fe32408884357450c5a9ead'],\n",
       "  'curPage': [2],\n",
       "  'keyword': ['薪水']},\n",
       " 3: {'dqs': ['050020'],\n",
       "  'pageSize': ['40'],\n",
       "  'sortFlag': ['15°radeFlag=0'],\n",
       "  'key': ['产品经理'],\n",
       "  'siTag': ['i9Jq-FcUGTpC9QESjC5G3Q~_FrslumzzaHrHe3aSW0VTQ'],\n",
       "  'd_sfrom': ['search_prime'],\n",
       "  'd_ckId': ['9505ccc598277c1e0f848712aa43bca9'],\n",
       "  'd_curPage': ['0'],\n",
       "  'd_pageSize': ['40'],\n",
       "  'd_headId': ['c08334cb0fe32408884357450c5a9ead'],\n",
       "  'curPage': [3],\n",
       "  'keyword': ['薪水']},\n",
       " 4: {'dqs': ['050020'],\n",
       "  'pageSize': ['40'],\n",
       "  'sortFlag': ['15°radeFlag=0'],\n",
       "  'key': ['产品经理'],\n",
       "  'siTag': ['i9Jq-FcUGTpC9QESjC5G3Q~_FrslumzzaHrHe3aSW0VTQ'],\n",
       "  'd_sfrom': ['search_prime'],\n",
       "  'd_ckId': ['9505ccc598277c1e0f848712aa43bca9'],\n",
       "  'd_curPage': ['0'],\n",
       "  'd_pageSize': ['40'],\n",
       "  'd_headId': ['c08334cb0fe32408884357450c5a9ead'],\n",
       "  'curPage': [4],\n",
       "  'keyword': ['薪水']},\n",
       " 5: {'dqs': ['050020'],\n",
       "  'pageSize': ['40'],\n",
       "  'sortFlag': ['15°radeFlag=0'],\n",
       "  'key': ['产品经理'],\n",
       "  'siTag': ['i9Jq-FcUGTpC9QESjC5G3Q~_FrslumzzaHrHe3aSW0VTQ'],\n",
       "  'd_sfrom': ['search_prime'],\n",
       "  'd_ckId': ['9505ccc598277c1e0f848712aa43bca9'],\n",
       "  'd_curPage': ['0'],\n",
       "  'd_pageSize': ['40'],\n",
       "  'd_headId': ['c08334cb0fe32408884357450c5a9ead'],\n",
       "  'curPage': [5],\n",
       "  'keyword': ['薪水']},\n",
       " 6: {'dqs': ['050020'],\n",
       "  'pageSize': ['40'],\n",
       "  'sortFlag': ['15°radeFlag=0'],\n",
       "  'key': ['产品经理'],\n",
       "  'siTag': ['i9Jq-FcUGTpC9QESjC5G3Q~_FrslumzzaHrHe3aSW0VTQ'],\n",
       "  'd_sfrom': ['search_prime'],\n",
       "  'd_ckId': ['9505ccc598277c1e0f848712aa43bca9'],\n",
       "  'd_curPage': ['0'],\n",
       "  'd_pageSize': ['40'],\n",
       "  'd_headId': ['c08334cb0fe32408884357450c5a9ead'],\n",
       "  'curPage': [6],\n",
       "  'keyword': ['薪水']},\n",
       " 7: {'dqs': ['050020'],\n",
       "  'pageSize': ['40'],\n",
       "  'sortFlag': ['15°radeFlag=0'],\n",
       "  'key': ['产品经理'],\n",
       "  'siTag': ['i9Jq-FcUGTpC9QESjC5G3Q~_FrslumzzaHrHe3aSW0VTQ'],\n",
       "  'd_sfrom': ['search_prime'],\n",
       "  'd_ckId': ['9505ccc598277c1e0f848712aa43bca9'],\n",
       "  'd_curPage': ['0'],\n",
       "  'd_pageSize': ['40'],\n",
       "  'd_headId': ['c08334cb0fe32408884357450c5a9ead'],\n",
       "  'curPage': [7],\n",
       "  'keyword': ['薪水']},\n",
       " 8: {'dqs': ['050020'],\n",
       "  'pageSize': ['40'],\n",
       "  'sortFlag': ['15°radeFlag=0'],\n",
       "  'key': ['产品经理'],\n",
       "  'siTag': ['i9Jq-FcUGTpC9QESjC5G3Q~_FrslumzzaHrHe3aSW0VTQ'],\n",
       "  'd_sfrom': ['search_prime'],\n",
       "  'd_ckId': ['9505ccc598277c1e0f848712aa43bca9'],\n",
       "  'd_curPage': ['0'],\n",
       "  'd_pageSize': ['40'],\n",
       "  'd_headId': ['c08334cb0fe32408884357450c5a9ead'],\n",
       "  'curPage': [8],\n",
       "  'keyword': ['薪水']},\n",
       " 9: {'dqs': ['050020'],\n",
       "  'pageSize': ['40'],\n",
       "  'sortFlag': ['15°radeFlag=0'],\n",
       "  'key': ['产品经理'],\n",
       "  'siTag': ['i9Jq-FcUGTpC9QESjC5G3Q~_FrslumzzaHrHe3aSW0VTQ'],\n",
       "  'd_sfrom': ['search_prime'],\n",
       "  'd_ckId': ['9505ccc598277c1e0f848712aa43bca9'],\n",
       "  'd_curPage': ['0'],\n",
       "  'd_pageSize': ['40'],\n",
       "  'd_headId': ['c08334cb0fe32408884357450c5a9ead'],\n",
       "  'curPage': [9],\n",
       "  'keyword': ['薪水']}}"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 建构参数模板生成器：keyword curPage\n",
    "def 参数模板生成(keyword, curPage):\n",
    "    参数 = 参数模板.copy()\n",
    "    参数['curPage'] = curPage\n",
    "    参数['keyword'] = keyword\n",
    "    return (参数)\n",
    "\n",
    "参数_keyword_产品经理_curPage = { \n",
    "    i:参数模板生成(curPage = [i], \\\n",
    "                  keyword = ['薪水']) \\\n",
    "    for i,v in href_字典.items()\\\n",
    "    }\n",
    "\n",
    "# print(参数_keyword_用户体验_curPage) # 只生成本页有的额外翻页URL, 并没有推估到&curPage=9,也没有这页\n",
    "\n",
    "print (df_qs.curPage_int.min()) # 最小值只有1\n",
    "print (df_qs.curPage_int.max()) # 最大值只有9\n",
    "\n",
    "# 应该是 0 (本页)....9(最大值)\n",
    "\n",
    "参数_keyword_产品经理_curPage = { \n",
    "    i:参数模板生成(curPage = [i], \\\n",
    "                  keyword = ['薪水']) \\\n",
    "    for i in range(0,df_qs.curPage_int.max()+1)\\\n",
    "    }\n",
    "参数_keyword_产品经理_curPage"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Wall time: 6.14 s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "import time\n",
    "from random import random\n",
    "time.sleep(3+4*random())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 翻页：数据备份与整合\n",
    "多个页面+多个关键词执行时，若怕中断最好把每一页的df内容备份做中继"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "公司名称 10\n"
     ]
    },
    {
     "ename": "AttributeError",
     "evalue": "'NoneType' object has no attribute 'to_csv'",
     "output_type": "error",
     "traceback": [
      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[1;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
      "\u001b[1;32m<timed exec>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m\u001b[0m\n",
      "\u001b[1;31mAttributeError\u001b[0m: 'NoneType' object has no attribute 'to_csv'"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "# 多个页面+多个关键词\n",
    "import time\n",
    "from random import random\n",
    "\n",
    "url = \"https://www.liepin.com/zhaopin/\"\n",
    "xpath_翻页a = '//div[@class=\"pagerbar\"]/a[starts-with(@href,\"/zhaopin\")]'\n",
    "\n",
    "keywords = ['公司名称','edu','薪水','产品经理']\n",
    "list_df = list()\n",
    "\n",
    "## 第一页试探有多长的页面\n",
    "for key in keywords:\n",
    "    payload = 参数模板生成(keyword=[key], curPage=[\"0\"])\n",
    "    df = requests_liepin(url, params = payload)\n",
    "    href_列表 = [x.xpath('//@href')[0] for x in r.html.xpath(xpath_翻页a)]\n",
    "    df = pd.DataFrame([ urlparse(x) for x in href_列表])\n",
    "    df_qs = pd.DataFrame([{k:v[0] for k,v in parse_qs(x).items()} for x in df['query'] ])\n",
    "    df_qs = df_qs.assign (curPage_int=df_qs.curPage.astype(int)) # 变成整数\n",
    "    长度 = df_qs.curPage_int.max()+1\n",
    "    参数_keyword_X_curPage = { \n",
    "        i:参数模板生成(curPage = [i], \\\n",
    "                      keyword = [key]) \\\n",
    "        for i in range(0,长度)\\\n",
    "        }\n",
    "    #print (参数_keyword_X_curPage)\n",
    "    print (key,长度)\n",
    "    \n",
    "    for k,v in 参数_keyword_X_curPage.items():\n",
    "        payload = v\n",
    "        df = requests_liepin( url, params = payload)\n",
    "        time.sleep(3+4*random())  #放慢脚步 3-7秒, 平均约5秒\n",
    "        ## 备份\n",
    "        df.to_csv(\"产品经理在广州的就业需求情况.tsv\"\\\n",
    "                  .format(key=key, k=k), sep=\"\\t\", encoding=\"utf-8\")\n",
    "        \n",
    "        df = df.assign (keyword = key)  # 区分  keyword    \n",
    "        df = df.assign (curPage = k)  # 区分  curPage    \n",
    "        list_df.append(df)\n",
    "        \n",
    "df_all = pd.concat(list_df).reset_index()\n",
    "df_all.index.name = '序'\n",
    "\n",
    "df_all.to_excel(\"产品经理在广州的就业需求情况_翻页.xlsx\",\\\n",
    "                sheet_name=\"_\".join(keywords))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 解析URL参数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'\\r\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\tmes产品经理 ': 'https://www.liepin.com/job/1929949425.shtml',\n",
       " '\\r\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t产品经理 ': 'https://www.liepin.com/job/1927290297.shtml',\n",
       " '\\r\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\tHR产品经理(J12892) ': 'https://www.liepin.com/job/1929939743.shtml',\n",
       " '\\r\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t英语产品经理（保险核心系统方向） ': 'https://www.liepin.com/job/1929897199.shtml',\n",
       " '\\r\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t保健食品产品经理 ': 'https://www.liepin.com/job/1930011415.shtml',\n",
       " '\\r\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t车联网产品经理 ': 'https://www.liepin.com/job/1929562157.shtml',\n",
       " '\\r\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t硬件产品经理（消费电子经验） ': 'https://www.liepin.com/job/1930002519.shtml',\n",
       " '\\r\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t车载智能调度产品经理 ': 'https://www.liepin.com/job/1930001481.shtml',\n",
       " '\\r\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t快消产品经理 ': 'https://www.liepin.com/job/1929876643.shtml',\n",
       " '\\r\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t产品经理（财务方向） ': 'https://www.liepin.com/job/1929604271.shtml',\n",
       " '\\r\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t内存产品经理（内存） ': 'https://www.liepin.com/job/1929384471.shtml',\n",
       " '\\r\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t餐饮产品经理人 ': 'https://www.liepin.com/job/1928842873.shtml',\n",
       " '\\r\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t产品策略企划主管/经理（北京） ': 'https://www.liepin.com/job/1928239299.shtml',\n",
       " '\\r\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t医疗器械产品经理 ': 'https://www.liepin.com/job/1926265681.shtml',\n",
       " '\\r\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t产品精算岗/产品精算经理 ': 'https://www.liepin.com/job/1913213116.shtml',\n",
       " '\\r\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t产品经理(投融资产品方向) ': 'https://www.liepin.com/job/1927547055.shtml',\n",
       " '\\r\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\tPM ': 'https://www.liepin.com/job/1924152831.shtml',\n",
       " '\\r\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t临床研究项目经理（PM） ': 'https://www.liepin.com/job/1926218903.shtml',\n",
       " '\\r\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t高级社交产品经理 ': 'https://www.liepin.com/job/1929970195.shtml',\n",
       " '\\r\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t无线产品解决方案经理 ': 'https://www.liepin.com/job/1929524637.shtml',\n",
       " '\\r\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t小家电产品经理 ': 'https://www.liepin.com/job/1927108427.shtml',\n",
       " '\\r\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t软件产品经理-上海 ': 'https://www.liepin.com/job/1924393391.shtml',\n",
       " '\\r\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t高级产品经理 ': 'https://www.liepin.com/job/1928542207.shtml',\n",
       " '\\r\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t电商产品经理（产品开发） ': 'https://www.liepin.com/job/1928587613.shtml',\n",
       " '\\r\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t全国市场产品经理 ': 'https://www.liepin.com/job/1929861555.shtml',\n",
       " '\\r\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t商业产品经理 ': '/a/20581509.shtml',\n",
       " '\\r\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t产品经理（接受非互联网转型） ': 'https://www.liepin.com/job/1929795793.shtml',\n",
       " '\\r\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t资深产品经理（C端用户增长方向） ': 'https://www.liepin.com/job/1929677019.shtml',\n",
       " '\\r\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t电网销渠道产品&项目经理 ': 'https://www.liepin.com/job/1929544055.shtml',\n",
       " '\\r\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\tProduct Manager/Senior Product Manager ': 'https://www.liepin.com/job/1929542961.shtml',\n",
       " '\\r\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t产品经理-YCJ ': 'https://www.liepin.com/job/1915251369.shtml'}"
      ]
     },
     "execution_count": 50,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "from requests_html import HTMLSession\n",
    "url = \"https://www.liepin.com/zhaopin/?key=产品经理\"\n",
    "session = HTMLSession()\n",
    "r = session.get(url)\n",
    "\n",
    "产品经理细分 = r.html.xpath(\"//div[@class='job-info']/h3/a\")  \n",
    "行业字典 = {a.xpath(\"a/text()\")[0]:a.xpath(\"a/@href\")[0]for a in 产品经理细分}\n",
    "行业字典"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 使用urllib.parse 解析数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[ParseResult(scheme='https', netloc='www.liepin.com', path='/job/1929949425.shtml', params='', query='', fragment=''),\n",
       " ParseResult(scheme='https', netloc='www.liepin.com', path='/job/1927290297.shtml', params='', query='', fragment=''),\n",
       " ParseResult(scheme='https', netloc='www.liepin.com', path='/job/1929939743.shtml', params='', query='', fragment=''),\n",
       " ParseResult(scheme='https', netloc='www.liepin.com', path='/job/1929897199.shtml', params='', query='', fragment=''),\n",
       " ParseResult(scheme='https', netloc='www.liepin.com', path='/job/1930011415.shtml', params='', query='', fragment=''),\n",
       " ParseResult(scheme='https', netloc='www.liepin.com', path='/job/1929562157.shtml', params='', query='', fragment=''),\n",
       " ParseResult(scheme='https', netloc='www.liepin.com', path='/job/1930002519.shtml', params='', query='', fragment=''),\n",
       " ParseResult(scheme='https', netloc='www.liepin.com', path='/job/1930001481.shtml', params='', query='', fragment=''),\n",
       " ParseResult(scheme='https', netloc='www.liepin.com', path='/job/1929876643.shtml', params='', query='', fragment=''),\n",
       " ParseResult(scheme='https', netloc='www.liepin.com', path='/job/1929604271.shtml', params='', query='', fragment=''),\n",
       " ParseResult(scheme='https', netloc='www.liepin.com', path='/job/1929384471.shtml', params='', query='', fragment=''),\n",
       " ParseResult(scheme='https', netloc='www.liepin.com', path='/job/1928842873.shtml', params='', query='', fragment=''),\n",
       " ParseResult(scheme='https', netloc='www.liepin.com', path='/job/1928239299.shtml', params='', query='', fragment=''),\n",
       " ParseResult(scheme='https', netloc='www.liepin.com', path='/job/1926265681.shtml', params='', query='', fragment=''),\n",
       " ParseResult(scheme='https', netloc='www.liepin.com', path='/job/1913213116.shtml', params='', query='', fragment=''),\n",
       " ParseResult(scheme='https', netloc='www.liepin.com', path='/job/1927547055.shtml', params='', query='', fragment=''),\n",
       " ParseResult(scheme='https', netloc='www.liepin.com', path='/job/1924152831.shtml', params='', query='', fragment=''),\n",
       " ParseResult(scheme='https', netloc='www.liepin.com', path='/job/1926218903.shtml', params='', query='', fragment=''),\n",
       " ParseResult(scheme='https', netloc='www.liepin.com', path='/job/1929970195.shtml', params='', query='', fragment=''),\n",
       " ParseResult(scheme='https', netloc='www.liepin.com', path='/job/1929524637.shtml', params='', query='', fragment=''),\n",
       " ParseResult(scheme='https', netloc='www.liepin.com', path='/job/1927108427.shtml', params='', query='', fragment=''),\n",
       " ParseResult(scheme='https', netloc='www.liepin.com', path='/job/1924393391.shtml', params='', query='', fragment=''),\n",
       " ParseResult(scheme='https', netloc='www.liepin.com', path='/job/1928542207.shtml', params='', query='', fragment=''),\n",
       " ParseResult(scheme='https', netloc='www.liepin.com', path='/job/1928587613.shtml', params='', query='', fragment=''),\n",
       " ParseResult(scheme='https', netloc='www.liepin.com', path='/job/1929861555.shtml', params='', query='', fragment=''),\n",
       " ParseResult(scheme='', netloc='', path='/a/20581509.shtml', params='', query='', fragment=''),\n",
       " ParseResult(scheme='https', netloc='www.liepin.com', path='/job/1929795793.shtml', params='', query='', fragment=''),\n",
       " ParseResult(scheme='https', netloc='www.liepin.com', path='/job/1929677019.shtml', params='', query='', fragment=''),\n",
       " ParseResult(scheme='https', netloc='www.liepin.com', path='/job/1929544055.shtml', params='', query='', fragment=''),\n",
       " ParseResult(scheme='https', netloc='www.liepin.com', path='/job/1929542961.shtml', params='', query='', fragment=''),\n",
       " ParseResult(scheme='https', netloc='www.liepin.com', path='/job/1915251369.shtml', params='', query='', fragment='')]"
      ]
     },
     "execution_count": 51,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from urllib.parse import urlparse, parse_qs\n",
    "[ urlparse(x) for x in 行业字典.values()]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 31 entries, 0 to 30\n",
      "Data columns (total 6 columns):\n",
      " #   Column    Non-Null Count  Dtype \n",
      "---  ------    --------------  ----- \n",
      " 0   scheme    31 non-null     object\n",
      " 1   netloc    31 non-null     object\n",
      " 2   path      31 non-null     object\n",
      " 3   params    31 non-null     object\n",
      " 4   query     31 non-null     object\n",
      " 5   fragment  31 non-null     object\n",
      "dtypes: object(6)\n",
      "memory usage: 1.6+ KB\n",
      "scheme       2\n",
      "netloc       2\n",
      "path        31\n",
      "params       1\n",
      "query        1\n",
      "fragment     1\n",
      "dtype: int64\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>scheme</th>\n",
       "      <th>netloc</th>\n",
       "      <th>path</th>\n",
       "      <th>params</th>\n",
       "      <th>query</th>\n",
       "      <th>fragment</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>https</td>\n",
       "      <td>www.liepin.com</td>\n",
       "      <td>/job/1929949425.shtml</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  scheme          netloc                   path params query fragment\n",
       "0  https  www.liepin.com  /job/1929949425.shtml                      "
      ]
     },
     "execution_count": 52,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "df = pd.DataFrame([urlparse(x) for x in 行业字典.values()])\n",
    "df.info()\n",
    "print(df.nunique())\n",
    "df.head(1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " # XHR 请求"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[<Element 'a' href='/zhaopin/?compkind=&dqs=&pubTime=&pageSize=40&salary=&compTag=&sortFlag=°radeFlag=0&compIds=&subIndustry=&jobKind=&industries=&compscale=&key=%E4%BA%A7%E5%93%81%E7%BB%8F%E7%90%86&siTag=i9Jq-FcUGTpC9QESjC5G3Q%7EfA9rXquZc5IkJpXC-Ycixw&d_sfrom=search_unknown&d_ckId=cc7e6b1f85d4bd4c7eea948ffab0f263&d_curPage=0&d_pageSize=40&d_headId=cc7e6b1f85d4bd4c7eea948ffab0f263&curPage=1'>, <Element 'a' href='/zhaopin/?compkind=&dqs=&pubTime=&pageSize=40&salary=&compTag=&sortFlag=°radeFlag=0&compIds=&subIndustry=&jobKind=&industries=&compscale=&key=%E4%BA%A7%E5%93%81%E7%BB%8F%E7%90%86&siTag=i9Jq-FcUGTpC9QESjC5G3Q%7EfA9rXquZc5IkJpXC-Ycixw&d_sfrom=search_unknown&d_ckId=cc7e6b1f85d4bd4c7eea948ffab0f263&d_curPage=0&d_pageSize=40&d_headId=cc7e6b1f85d4bd4c7eea948ffab0f263&curPage=2'>, <Element 'a' href='/zhaopin/?compkind=&dqs=&pubTime=&pageSize=40&salary=&compTag=&sortFlag=°radeFlag=0&compIds=&subIndustry=&jobKind=&industries=&compscale=&key=%E4%BA%A7%E5%93%81%E7%BB%8F%E7%90%86&siTag=i9Jq-FcUGTpC9QESjC5G3Q%7EfA9rXquZc5IkJpXC-Ycixw&d_sfrom=search_unknown&d_ckId=cc7e6b1f85d4bd4c7eea948ffab0f263&d_curPage=0&d_pageSize=40&d_headId=cc7e6b1f85d4bd4c7eea948ffab0f263&curPage=3'>, <Element 'a' href='/zhaopin/?compkind=&dqs=&pubTime=&pageSize=40&salary=&compTag=&sortFlag=°radeFlag=0&compIds=&subIndustry=&jobKind=&industries=&compscale=&key=%E4%BA%A7%E5%93%81%E7%BB%8F%E7%90%86&siTag=i9Jq-FcUGTpC9QESjC5G3Q%7EfA9rXquZc5IkJpXC-Ycixw&d_sfrom=search_unknown&d_ckId=cc7e6b1f85d4bd4c7eea948ffab0f263&d_curPage=0&d_pageSize=40&d_headId=cc7e6b1f85d4bd4c7eea948ffab0f263&curPage=4'>, <Element 'a' href='/zhaopin/?compkind=&dqs=&pubTime=&pageSize=40&salary=&compTag=&sortFlag=°radeFlag=0&compIds=&subIndustry=&jobKind=&industries=&compscale=&key=%E4%BA%A7%E5%93%81%E7%BB%8F%E7%90%86&siTag=i9Jq-FcUGTpC9QESjC5G3Q%7EfA9rXquZc5IkJpXC-Ycixw&d_sfrom=search_unknown&d_ckId=cc7e6b1f85d4bd4c7eea948ffab0f263&d_curPage=0&d_pageSize=40&d_headId=cc7e6b1f85d4bd4c7eea948ffab0f263&curPage=1'>, <Element 'a' class=('last',) href='/zhaopin/?compkind=&dqs=&pubTime=&pageSize=40&salary=&compTag=&sortFlag=°radeFlag=0&compIds=&subIndustry=&jobKind=&industries=&compscale=&key=%E4%BA%A7%E5%93%81%E7%BB%8F%E7%90%86&siTag=i9Jq-FcUGTpC9QESjC5G3Q%7EfA9rXquZc5IkJpXC-Ycixw&d_sfrom=search_unknown&d_ckId=cc7e6b1f85d4bd4c7eea948ffab0f263&d_curPage=0&d_pageSize=40&d_headId=cc7e6b1f85d4bd4c7eea948ffab0f263&curPage=9' title='末页'>]\n"
     ]
    }
   ],
   "source": [
    "import pandas as pd\n",
    "from requests_html import HTMLSession\n",
    "\n",
    "url = \"https://www.liepin.com/zhaopin/?key=产品经理\"\n",
    "session = HTMLSession()\n",
    "r = session.get( url )\n",
    "\n",
    "xpath_翻页a = '//div[@class=\"pagerbar\"]/a' # 有disabled, current等href是javascript\n",
    "xpath_翻页a = '//div[@class=\"pagerbar\"]/a[starts-with(@href,\"/zhaopin\")]'\n",
    "print (r.html.xpath(xpath_翻页a))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# scrapy框架"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'2.2.1'"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import scrapy\n",
    "scrapy.__version__"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'q': ['产品经理'],\n",
       " 'qs': ['n'],\n",
       " 'form': ['QBLHCN'],\n",
       " 'sp': ['-1'],\n",
       " 'pq': ['产品经理'],\n",
       " 'sc': ['0-4'],\n",
       " 'cvid': ['36D9D2A8DE934DD3885F8E393AEF3B28']}"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "ParseResult(scheme='https', netloc='cn.bing.com', path='/search', params='', query='q=%E4%BA%A7%E5%93%81%E7%BB%8F%E7%90%86&qs=n&form=QBLHCN&sp=-1&pq=%E4%BA%A7%E5%93%81%E7%BB%8F%E7%90%86&sc=0-4&sk=&cvid=36D9D2A8DE934DD3885F8E393AEF3B28', fragment='')"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "url = \"https://cn.bing.com/search?q=%E4%BA%A7%E5%93%81%E7%BB%8F%E7%90%86&qs=n&form=QBLHCN&sp=-1&pq=%E4%BA%A7%E5%93%81%E7%BB%8F%E7%90%86&sc=0-4&sk=&cvid=36D9D2A8DE934DD3885F8E393AEF3B28\"\n",
    "\n",
    "from urllib.parse import urlparse, parse_qs, urlencode    # 使用 urllib 模块\n",
    "\n",
    "from time import time        # 使用 time 模块 ts\n",
    "\n",
    "def parse_url_qs (url):\n",
    "    six_parts = urlparse(url)       # 输入：url \n",
    "    out = parse_qs(six_parts.query)\n",
    "    return (out, six_parts)        # 输出：url各别解析成果，六大块\n",
    "\n",
    "参数, six = parse_url_qs (url)\n",
    "display(参数)\n",
    "display(six)\n",
    "\n",
    "参_template = 参数.copy()\n",
    "\n",
    "# 图片参数模板\n",
    "def get_url_byN(N):\n",
    "    参_template['idx'] = [str(N)]\n",
    "    参_template['ts'] = [int(time())]\n",
    "    q = urlencode(参_template, doseq = True) # doseq = True\n",
    "    six_new = six._replace(query=q)\n",
    "    u = six_new.geturl()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Scrapy项目创建"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "New Scrapy project 'QiMoxmu', using template directory 'c:\\programdata\\anaconda3\\lib\\site-packages\\scrapy\\templates\\project', created in:\n",
      "    D:\\网络与新媒体2020-学习资料\\Web数据挖掘\\QiMoxmu\\QiMoxmu\n",
      "\n",
      "You can start your first spider with:\n",
      "    cd QiMoxmu\n",
      "    scrapy genspider example example.com\n"
     ]
    }
   ],
   "source": [
    "# B1 scrapy startproject 项目\n",
    "! scrapy startproject QiMoxmu"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "文件夹 PATH 列表\n",
      "卷序列号为 1222-7B9C\n",
      "D:.\n",
      "│  Python数据挖掘.docx\n",
      "│  大学本科毕业生在广州的产品经理就业情况.tsv\n",
      "│  数据挖掘期末项目.ipynb\n",
      "│  \n",
      "├─.ipynb_checkpoints\n",
      "└─QiMoxmu\n",
      "    │  scrapy.cfg\n",
      "    │  \n",
      "    └─QiMoxmu\n",
      "        │  items.py\n",
      "        │  middlewares.py\n",
      "        │  pipelines.py\n",
      "        │  settings.py\n",
      "        │  __init__.py\n",
      "        │  \n",
      "        └─spiders\n",
      "                __init__.py\n",
      "                \n"
     ]
    }
   ],
   "source": [
    "#  B2 tree/f 查看项目目录结构\n",
    "! tree/f"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " 驱动器 D 中的卷没有标签。\n",
      " 卷的序列号是 1222-7B9C\n",
      "\n",
      " D:\\网络与新媒体2020-学习资料\\Web数据挖掘\\QiMoxmu 的目录\n",
      "\n",
      "2020/07/19  16:46    <DIR>          .\n",
      "2020/07/19  16:46    <DIR>          ..\n",
      "2020/07/19  16:42    <DIR>          .ipynb_checkpoints\n",
      "2020/07/19  00:51            18,867 Python数据挖掘.docx\n",
      "2020/07/19  16:46    <DIR>          QiMoxmu\n",
      "2020/07/19  13:02             8,017 大学本科毕业生在广州的产品经理就业情况.tsv\n",
      "2020/07/19  16:41            98,062 数据挖掘期末项目.ipynb\n",
      "               3 个文件        124,946 字节\n",
      "               4 个目录 832,645,165,056 可用字节\n"
     ]
    }
   ],
   "source": [
    "! dir"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Scrapy项目下创建爬虫"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Created spider 'BingImageSpider' using template 'basic' \n"
     ]
    }
   ],
   "source": [
    "! scrapy genspider BingImageSpider \"https://www.liepin.com/zhaopin/?\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Scrapy 2.2.1 - no active project\n",
      "\n",
      "Usage:\n",
      "  scrapy <command> [options] [args]\n",
      "\n",
      "Available commands:\n",
      "  bench         Run quick benchmark test\n",
      "  commands      \n",
      "  fetch         Fetch a URL using the Scrapy downloader\n",
      "  genspider     Generate new spider using pre-defined templates\n",
      "  runspider     Run a self-contained spider (without creating a project)\n",
      "  settings      Get settings values\n",
      "  shell         Interactive scraping console\n",
      "  startproject  Create new project\n",
      "  version       Print Scrapy version\n",
      "  view          Open URL in browser, as seen by Scrapy\n",
      "\n",
      "  [ more ]      More commands available when run from project directory\n",
      "\n",
      "Use \"scrapy <command> -h\" to see more info about a command\n"
     ]
    }
   ],
   "source": [
    "! scrapy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "D:\\网络与新媒体2020-学习资料\\Web数据挖掘\\QiMoxmu\n"
     ]
    }
   ],
   "source": [
    "! cd"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
